Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Jenny Nguyen, UID: 805947763

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.6.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.1       htmltools_0.5.7   yaml_2.3.8        rmarkdown_2.25   
 [9] knitr_1.45        jsonlite_1.8.8    xfun_0.41         digest_0.6.34    
[13] rlang_1.1.3       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Display your machine memory.

memuse::Sys.meminfo()
Totalram:   16.000 GiB 
Freeram:   897.688 MiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer:

#define subject id
sid <- 10013310

#read in data files and filter rows for patient 10013310
transfers_10013310 <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
  filter(subject_id == sid) 
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
patients_10013310 <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  filter(subject_id == sid)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions_10013310 <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  filter(subject_id == sid) |>
  #select first row since only need demographic information from this file
  slice_head()
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
labevents_10013310 <- arrow::open_dataset("labevents_pq/part-0.parquet", 
                                          format = "parquet") |> 
  filter(subject_id == sid) |>
  as_tibble()
d_icd_procedures_tble <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
procedures_icd_10013310 <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") |>
  filter(subject_id == sid) |>
  #transform chartdate variable into POSIXct type 
  mutate(chartdate = as.POSIXct(chartdate, format = "%Y-%m-%d")) |>
  #join procedures dictionary to obtain procedure titles later
  left_join(d_icd_procedures_tble, by = c("icd_code", "icd_version"))
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_diagnoses_tble <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diagnoses_icd_10013310 <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") |>
  filter(subject_id == sid) |>
  #join diagnoses dictionary to obtain diagnoses titles later
  left_join(d_icd_diagnoses_tble, by = c("icd_code", "icd_version")) |>
  #group by each admission
  group_by(hadm_id) |>
  #arrange diagnoses in order for each admission
  arrange(seq_num, .by_group = TRUE) |>
  #remove grouping
  ungroup() |>
  #select rows with the first three diagnoses during the first admission
  slice(1:3)
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#define title and subtitle
my_title <- paste(c("Patient", sid, ",", print(patients_10013310$gender), ",", 
                  print(patients_10013310$anchor_age), ",", 
                  print(admissions_10013310$race)), collapse = " ")
[1] "F"
[1] 70
[1] "BLACK/AFRICAN"
my_subtitle <- paste(print(diagnoses_icd_10013310$long_title), collapse = "\n")
[1] "Subsequent non-ST elevation (NSTEMI) myocardial infarction"
[2] "Acute on chronic systolic (congestive) heart failure"      
[3] "Other cardiomyopathies"                                    
#plot graph
ggplot() +
  #plot segment for ADT events 
  geom_segment(data = transfers_10013310,
               mapping = aes(
                 #plot transfer times on x axis 
                 x = intime, xend = outtime, 
                 #picked a random y value to place segment on 
                 y = 0.025, yend = 0.025, 
                 #set color according to care unit 
                 color = careunit),
               #set line width as 6 if care unit was ICU or CCU, and 2 else
               linewidth = ifelse(grepl("ICU|CCU", transfers_10013310$careunit), 
                             6, 2)) +
  #add new layer, plot points for lab events 
  geom_point(data = labevents_10013310,
             #plot chart time on x axis, picked y value less than y value for
             #ADT events
             mapping = aes(x = charttime, y = 0),
             #set point shape as crosses
             shape = 3) +
  #add new layer, plot points for procedure events 
  geom_point(data = procedures_icd_10013310,
             #plot chart date on x axis, picked y value less than y value for 
             #lab events, set shape according to procedure title
             mapping = aes(x = chartdate, y = -0.025, shape = long_title)) +
  #select shapes for each procedure (since default only provides 6)
  scale_shape_manual(values = c(15, 16, 17, 3, 6, 7, 8, 10, 13)) +
  #add title, subtitle, and axis/legend titles
  labs(title = my_title,
       subtitle = my_subtitle,
       x = "Calendar Time",
       y = " ",
       shape = "Procedure",
       color = "Care Unit") +
  #set background to white
  theme_bw() + 
  #place legend at bottom of plot, adjust legend size to fit in plot, adjust 
  #aspect ratio to make plot wider 
  theme(legend.position = "bottom",
        legend.text = element_text(size = 5, vjust = 0.5), 
        legend.key.width = unit(0.1, "cm"),
        legend.key.size = unit(0.3, "cm"),
        legend.spacing = unit(0.05, "cm"), 
        aspect.ratio = 1/4) +
  #adjust legend columns and rows to fit plot
  guides(shape = guide_legend(ncol = 1, nrow = 9),
         color = guide_legend(ncol = 1, nrow = 7)) +
  #set labels for type of event on y axis 
  scale_y_continuous(breaks = c(-0.025, 0, 0.025),
                     labels = c("Procedure", "Lab", "ADT")) +
  #set y axis limits for the plot
  coord_cartesian(ylim = c(-0.04, 0.04)) 
Warning: Removed 3 rows containing missing values (`geom_segment()`).

Above is my visualization for patient 10013310, with the codes and documentation. For the top 3 diagnoses, I selected the top 3 diagnoses from the first hospital admission. For lab events, I used charttime instead of storetime to document the timing of lab events. According to this plot, this patient has been at the ICU/CCU twice over the course of their hospital stays during May to July. They have had multiple lab events and several procedures throughout their stays.

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer:

#read in data files and filter rows for patient 10013310
chartevents_10013310 <- arrow::open_dataset("chartevents_pq/part-0.parquet", 
                                            format = "parquet") |>
  filter(subject_id == sid) |>
  as_tibble()
icustays_10013310 <- read_csv("~/mimic/icu/icustays.csv.gz") |>
  filter(subject_id == sid) |>
  #join icu stays with chart events by subject id, hadm id, and stay id
  left_join(chartevents_10013310, 
            by = c("subject_id", "hadm_id", "stay_id"), 
            relationship = "many-to-many") |>
  #filter by vitals that we are interested in
  filter(itemid %in% c(220045, 220179, 220180, 223761, 220210))
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#define title and labels 
my_title2 <- paste(c("Patient", sid, "ICU stays - Vitals"), collapse = " ")
ylabels <- c("220045" = "HR",
             "220179" = "NBPs",
             "220180" = "NBPd",
             "220210" = "RR",
             "223761" = "Temperature F")

#plot graph
ggplot(data = icustays_10013310,
       #plot chart time on x axis, measurement value on y axis, set color 
       #according to vital
       mapping = aes(x = charttime, y = valuenum, color = factor(itemid))) +
  #add lines, don't show legend
  geom_line(show.legend = FALSE) + 
  #add points, don't show legend
  geom_point(show.legend = FALSE) +
  #form panels based on vitals and icu stay
  facet_grid(itemid ~ stay_id,
             #allow scales to vary across vitals and icu stay
             scales = "free",
             #set vital labels according to previously defined labels 
             labeller = labeller(itemid = as_labeller(ylabels))) +
  #add title 
  labs(title = my_title2) +
  #set background to light 
  theme_light() + 
  #remove axis titles 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank()) +
  #include more space between x axis text
  scale_x_datetime(guide = guide_axis(n.dodge = 2)) +
  #set y axis breaks 
  scale_y_continuous(n.breaks = 4)

Above is my visualization for patient 10013310, with the codes and documentation. According to this plot, this patient stayed at the ICU twice, once during early May and again during mid June. Their second stay was much longer than the first stay. They had a lot more vital measurements taken during their second stay, with a lot of variability in the measurements. It looks like they also had high heart rates and a body temperature reaching 103 F at one point during the second stay, perhaps indicating a more severe health condition than the first stay.

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

Answer:

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") |>
  print()
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 2 more variables: outtime <dttm>, los <dbl>

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer:

icustays_tble |>
  distinct(subject_id) |>
  summarize(rows = n())
# A tibble: 1 × 1
   rows
  <int>
1 50920

There are 50,920 unique subject_id.

icustays_tble |>
  summarize(rows = n())
# A tibble: 1 × 1
   rows
  <int>
1 73181

Without specifying unique values for subject_id, there are 73,181 total rows, which is more than the 50,920 rows from specifying unique values for subject_id. Thus, a subject_id can have multiple ICU stays.

icustays_tble |>
  group_by(subject_id) |>
  summarize(count = n()) |>
  ggplot() +
  geom_bar(mapping = aes(x = count)) +
  labs(
    title = "Distribution of ICU stays of patients", 
    x = "# of ICU Stays",
    y = "Count"
  )

A majority of the patients have only one ICU stay. Some patients have 2-5 ICU stays, and a select few have 5+ ICU stays. The maximum number of ICU stays by a patient is ~30-40 ICU stays.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

Answer:

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  print()
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime          
        <dbl>    <dbl> <dttm>              <dttm>             
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00
# ℹ 431,221 more rows
# ℹ 12 more variables: deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient

Answer:

admissions_tble |>
  group_by(subject_id) |>
  summarize(count = n()) |>
  ggplot() +
  geom_bar(mapping = aes(x = count)) +
  labs(
    title = "Distribution of admissions of patients", 
    x = "# of Admissions",
    y = "Count"
  )

A majority of the patients have only a couple of admissions. Some patients have more than a couple of admissions, and a select few have multiple admissions. The maximum number of admissions by a patient is ~200-250 admissions.

  • admission hour (anything unusual?)

Answer:

ggplot(data = admissions_tble) +
  geom_bar(mapping = aes(x = hour(admittime))) + 
  labs(
    title = "Distribution of the hour of admission", 
    x = "Hour of Admission",
    y = "Count"
  ) 

Yes, there is an unusually high number of admissions that occur at 0am and 7am, relative to the other hours in the morning. Aside from these two unusual hours, admissions are generally low in the morning; however, admissions start to increase in the early afternoon and stay at high numbers during the night.

  • admission minute (anything unusual?)

Answer:

ggplot(data = admissions_tble) +
  geom_bar(mapping = aes(x = minute(admittime))) + 
  labs(
    title = "Distribution of the minute of admission", 
    x = "Minute of Admission",
    y = "Count"
  ) 

Yes, there is an unusually high number of admissions that occur at the 0 minute, 15 minute, 30 minute, and 45 minute. This could be due workers rounding to the nearest quarter of the hour when recording the admission time. Aside from these unusual minutes, all other minutes have similar numbers of admissions.

  • length of hospital stay (from admission to discharge) (anything unusual?)

Answer:

ggplot(data = admissions_tble) +
  geom_bar(mapping = aes(x = (as.duration(admittime %--% dischtime) / 86400))) + 
  labs(
    title = "Distribution of hospital stay length", 
    x = "Hospital Stay Length (Days)",
    y = "Count"
  ) 

Yes, the x axis of the graph extends to 300 days, indicating that there is an unusually long hospital stay that lasted around 300 days. Most hospital stays are relatively short, spanning a day to a few days. However, there do seem to be some hospital stays that are longer, spanning several days to weeks.

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer:

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  print()
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

Answer:

ggplot(data = patients_tble) +
  geom_boxplot(mapping = aes(x = gender, y = anchor_age)) + 
  labs(
    title = "Distribution of patient age by patient gender", 
    x = "Gender",
    y = "Age"
  ) 

ggplot(data = patients_tble) +
  geom_histogram(binwidth = 2, 
                 mapping = aes(x = anchor_age, fill = gender),
                 position = "dodge") + 
  labs(
    title = "Distribution of patient age by patient gender",
    x = "Age",
    y = "Count",
    fill = "Gender"
  ) 

The median age of male patients is slightly higher than the median age of female patients. In general, there is a larger share of young adult patients compared to adult and elderly patients, for both genders. This observation is surprising, as we would assume the opposite with more elderly patients and less young adult patients. However, the oldest elderly patients seem to make up a sizable share of the total patients, likely due to worsening health conditions. Across young adult and elderly patients, there is a larger share of female patients, while across older adult patients, there is a larger share of male patients. This could be due to multiple factors or a combination of factors, e.g. women may be more likely to seek out medical care or come in more often for female-specific health conditions that are critical at younger and older ages.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Answer:

#read in lab events data file 
labevents_subset <- arrow::open_dataset("labevents_pq/part-0.parquet", 
                                        format = "parquet") |>
  #select necessary columns
  select(subject_id, itemid, storetime, valuenum) |>
  #filter by lab measurements we are interested in 
  filter(itemid %in% c(50912, 50971, 50983, 50902, 
                       50882, 51221, 51301, 50931)) |>
  #turn data into a tibble
  as_tibble()

#create a subset of icu stays tibble 
icustays_subset <- icustays_tble |> 
  #select necessary columns
  select(subject_id, stay_id, intime)

#join subsets of icu stays and lab events together by subject id
labevents_tble <- left_join(icustays_subset, labevents_subset, 
                            by = "subject_id", 
                            relationship = "many-to-many") |>
  #calculate the duration between store time and in time in new column span
  mutate(span = as.duration(storetime - intime)) |> 
  #group data by subject id, then stay id, then item id
  group_by(subject_id, stay_id, itemid) |>
  #arrange data by span (shortest to longest) for each group
  arrange(span, .by_group = TRUE) |>
  #select rows where lab measurements were taken before transfer to icu (when
  #store time is before in time, span is negative)
  filter(span < 0) |>
  #for each group, select the last row (the most recent measurement before icu 
  #stay)
  slice_tail() |>
  #remove columns that are no longer needed 
  select(-c(intime, storetime, span)) |>
  #turn data into wide format 
  pivot_wider(names_from = itemid, values_from = valuenum) |>
  #rename columns based on correct lab measurement name 
  rename(
    creatinine = "50912",
    potassium = "50971",
    sodium = "50983",
    chloride = "50902",
    bicarbonate = "50882",
    hematocrit = "51221",
    wbc = "51301",
    glucose = "50931"
  ) |>
  #remove grouping 
  ungroup() |>
  print()
# A tibble: 68,467 × 10
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Answer:

#read in chart events data file
chartevents_subset <- arrow::open_dataset("chartevents_pq/part-0.parquet", 
                                          format = "parquet") |>
  #select necessary columns
  select(subject_id, stay_id, charttime, itemid, valuenum) |>
  #filter by vitals we are interested in
  filter(itemid %in% c(220045, 220179, 220180, 223761, 220210)) |>
  #turn data into a tibble
  as_tibble()

#join subsets of icu stays and chart events together by subject id and stay id
chartevents_tble <- left_join(chartevents_subset, icustays_subset, 
                              by = c("subject_id", "stay_id"), 
                              relationship = "many-to-many") |> 
  #calculate the duration between chart time and in time in new column span
  mutate(span = as.duration(charttime - intime)) |>
  #group data by subject id, then stay id, then item id
  group_by(subject_id, stay_id, itemid) |>
  #arrange data by span (shortest to longest) for each group
  arrange(span, .by_group = TRUE) |>
  #for each group, select the first row (the first vital measurement in icu
  #stay)
  slice_head() |> 
  #remove columns that are no longer needed 
  select(-c(intime, charttime, span)) |>
  #turn data into wide format
  pivot_wider(names_from = itemid, values_from = valuenum) |>
  #rename columns based on correct vital name
  rename(
    heart_rate = "220045",
    non_invasive_blood_pressure_systolic = "220179",
    non_invasive_blood_pressure_diastolic = "220180",
    temperature_fahrenheit = "223761",
    respiratory_rate = "220210"
  ) |> 
  #remove grouping
  ungroup() |>
  print()
# A tibble: 73,164 × 7
   subject_id  stay_id heart_rate non_invasive_blood_pr…¹ non_invasive_blood_p…²
        <dbl>    <dbl>      <dbl>                   <dbl>                  <dbl>
 1   10000032 39553978         91                      84                     48
 2   10000980 39765666         77                     150                     77
 3   10001217 34592300         96                     167                     95
 4   10001217 37067082         86                     151                     90
 5   10001725 31205490         55                      73                     56
 6   10001884 37510196         38                     180                     12
 7   10002013 39060235         80                     104                     70
 8   10002155 31090461         94                     118                     51
 9   10002155 32358465         98                     109                     65
10   10002155 33685454         68                     126                     61
# ℹ 73,154 more rows
# ℹ abbreviated names: ¹​non_invasive_blood_pressure_systolic,
#   ²​non_invasive_blood_pressure_diastolic
# ℹ 2 more variables: respiratory_rate <dbl>, temperature_fahrenheit <dbl>

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer:

#join icu stays and admissions tibbles together by subject id and hadm id
mimic_icu_cohort <- left_join(icustays_tble, admissions_tble, 
                              by = c("subject_id", "hadm_id")) |>
  #join patients tibble to the previous join by subject id
  left_join(patients_tble, by = "subject_id") |>
  #join lab events tibble to the previous join by subject id and stay id
  left_join(labevents_tble, by = c("subject_id", "stay_id")) |>
  #join chart events tibble to the previous join by subject id and stay id
  left_join(chartevents_tble, by = c("subject_id", "stay_id")) |>
  #extract year from in time, calculate difference between in time year and 
  #anchor year, add the difference onto anchor age to find age at in time 
  mutate(intime_year = year(intime),
         year_diff = intime_year - anchor_year,
         intime_age = anchor_age + year_diff) |>
  #filter by rows where patient's age at in time is 18 years or older
  filter(intime_age >= 18) |>
  #remove columns that are no longer needed
  select(-c(intime_year, year_diff)) |>
  print()
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 35 more variables: outtime <dttm>, los <dbl>, admittime <dttm>,
#   dischtime <dttm>, deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>, gender <chr>, anchor_age <dbl>, …

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

Answer:

mimic_icu_cohort |>
  group_by(race) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = fct_reorder(as.factor(race), total_los), 
                         y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by patient race", 
    x = "Race",
    y = "Total ICU Stay Length (Fractional Days)"
  ) +
  coord_flip()

White patients have the longest total ICU stay, followed by patients with an unknown race and Black/African American patients. Hispanic/Latino Central American patients have the shortest total ICU stay.

mimic_icu_cohort |>
  group_by(insurance) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = fct_reorder(as.factor(insurance), total_los), 
                         y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by patient's insurance", 
    x = "Insurance",
    y = "Total ICU Stay Length (Fractional Days)"
  ) +
  coord_flip()

Patients with insurance other than Medicare and Medicaid have the longest total ICU stay, closely followed by patients with Medicare insurance. Compared to patients with Medicare or other insurance, patients with Medicaid have a shorter total ICU stay.

mimic_icu_cohort |>
  group_by(marital_status) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = fct_reorder(as.factor(marital_status), total_los), 
                         y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by patient's marital status", 
    x = "Marital Status",
    y = "Total ICU Stay Length (Fractional Days)"
  ) +
  coord_flip()

Married patients have the longest total ICU stay, followed by single patients. Compared to married or single patients, widowed and divorced patients as well as patients with missing marital status data have shorter total ICU stays.

mimic_icu_cohort |>
  group_by(gender) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = fct_reorder(as.factor(gender), total_los), 
                         y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by patient gender", 
    x = "Gender",
    y = "Total ICU Stay Length (Fractional Days)"
  ) +
  coord_flip()

Male patients have a longer total ICU stay, compared to female patients.

mimic_icu_cohort |>
  group_by(intime_age) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = intime_age, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by patient age", 
    x = "Age",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 

Young adult patients have shorter total ICU stays. The oldest elderly patients also have shorter total ICU stays, which could be due to fewer patients who live up to these ages. Adults aged 50-90 years tend to have longer total ICU stays, due to declining health with older age.

  • Length of ICU stay los vs the last available lab measurements before ICU stay

Answer:

mimic_icu_cohort |>
  group_by(bicarbonate) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = bicarbonate, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by bicarbonate measurement", 
    x = "Bicarbonate (mEq/L)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with bicarbonate levels between ~20-30 mEq/L measured right before their ICU stay have longer total ICU stays, while patients with very low or very high bicarbonate levels have shorter total ICU stays.

mimic_icu_cohort |>
  group_by(chloride) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = chloride, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by chloride measurement", 
    x = "Chloride (mEq/L)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with chloride levels ~100 mEq/L have the longest total ICU stays, while patients with very low or very high chloride levels have shorter total ICU stays.

mimic_icu_cohort |>
  group_by(creatinine) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = creatinine, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by creatinine measurement", 
    x = "Creatinine (mg/dL)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with very low creatinine levels have longer total ICU stays, while patients with higher creatinine levels have shorter total ICU stays.

mimic_icu_cohort |>
  group_by(glucose) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = glucose, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by glucose measurement", 
    x = "Glucose (mg/dL)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with low glucose levels have longer total ICU stays, while patients with higher glucose levels have shorter total ICU stays.

mimic_icu_cohort |>
  group_by(potassium) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = potassium, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by potassium measurement", 
    x = "Potassium (mEq/L)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with potassium levels ~4 mEq/L have the longest total ICU stays, while patients with lower or higher potassium levels have shorter total ICU stays.

mimic_icu_cohort |>
  group_by(sodium) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = sodium, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by sodium measurement", 
    x = "Sodium (mEq/L)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with sodium levels ~140 mEq/L have the longest total ICU stays, while patients with lower and higher sodium levels have shorter total ICU stays.

mimic_icu_cohort |>
  group_by(hematocrit) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = hematocrit, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by hematocrit measurement", 
    x = "Hematocrit (%)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with hematocrit levels between ~30-45% have longer total ICU stays, while patients with lower and higher hematocrit levels have shorter total ICU stays.

mimic_icu_cohort |>
  group_by(wbc) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = wbc, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by white blood cell count", 
    x = "White Blood Cell Count",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with very low white blood cell counts have longer total ICU stays, while patients with higher white blood cell counts have shorter total ICU stays.

  • Length of ICU stay los vs the first vital measurements within the ICU stay

Answer:

ggplot(data = mimic_icu_cohort) +
  geom_point(mapping = aes(x = heart_rate, y = los),
             position = "jitter") +
  labs(
    title = "Distribution of ICU stay length by heart rate",
    x = "Heart Rate (bpm)",
    y = "ICU Stay Length (Fractional Days)"
  )
Warning: Removed 18 rows containing missing values (`geom_point()`).

mimic_icu_cohort |>
  group_by(heart_rate) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = heart_rate, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by heart rate", 
    x = "Heart Rate (bpm)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) + 
  coord_cartesian(xlim = c(0, 250))
Warning: Removed 1 rows containing missing values (`position_stack()`).

The first plot shows one observation with an extremely high heart rate (possibly due to a recording error?), which masks the variability in the distribution. The second plot ignores this observation, zooming into the distribution of heart rates between 0-250 bpm. In the second plot, we can see that patients with heart rates between ~75-100 bpm have the longest total ICU stays, compared to patients with lower or higher heart rates.

ggplot(data = mimic_icu_cohort) +
  geom_point(mapping = aes(x = non_invasive_blood_pressure_systolic, 
                           y = los),
             position = "jitter") +
  labs(
    title = "Distribution of ICU stay length by systolic non-invasive blood pressure",
    x = "Systolic Non-invasive Blood Pressure (mmHg)",
    y = "ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 950 rows containing missing values (`geom_point()`).

mimic_icu_cohort |>
  group_by(non_invasive_blood_pressure_systolic) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = non_invasive_blood_pressure_systolic, 
                         y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by systolic non-invasive blood pressure", 
    x = "Systolic Non-invasive Blood Pressure (mmHg)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) + 
  coord_cartesian(xlim = c(0, 275))
Warning: Removed 1 rows containing missing values (`position_stack()`).

The first plot shows two observation with extremely high systolic non-invasive blood pressures (possibly due to recording errors?), which masks the variability in the distribution. The second plot ignores these observations, zooming into the distribution of systolic non-invasive blood pressures between 0-275 mmHg. In the second plot, we can see that patients with systolic non-invasive blood pressures between ~100-125 mmHg have the longest total ICU stays, compared to patients with lower or higher blood pressures.

ggplot(data = mimic_icu_cohort) +
  geom_point(mapping = aes(x = non_invasive_blood_pressure_diastolic, 
                           y = los),
             position = "jitter") +
  labs(
    title = "Distribution of ICU stay length by diastolic non-invasive blood pressure",
    x = "Diastolic Non-invasive Blood Pressure (mmHg)",
    y = "ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 953 rows containing missing values (`geom_point()`).

mimic_icu_cohort |>
  group_by(non_invasive_blood_pressure_diastolic) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = non_invasive_blood_pressure_diastolic, 
                         y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by diastolic non-invasive blood pressure", 
    x = "Diastolic Non-invasive Blood Pressure (mmHg)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) + 
  coord_cartesian(xlim = c(0, 225))
Warning: Removed 1 rows containing missing values (`position_stack()`).

The first plot shows several observations with extremely high diastolic non-invasive blood pressures (possibly due to recording errors?), which masks the variability in the distribution. The second plot ignores these observation, zooming into the distribution of diastolic non-invasive blood pressures between 0-225 mmHg. In the second plot, we can see that patients with diastolic non-invasive blood pressures between ~50-75 mmHg have the longest total ICU stays, compared to patients with lower or higher blood pressures.

mimic_icu_cohort |>
  group_by(respiratory_rate) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = respiratory_rate, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by respiratory rate", 
    x = "Respiratory Rate (insp/min)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1 rows containing missing values (`position_stack()`).

Patients with respiratory rates ~20 insp/min have the longest total ICU stays, while patients with lower or higher respiratory rates tend to have shorter ICU stays.

ggplot(data = mimic_icu_cohort) +
  geom_point(mapping = aes(x = temperature_fahrenheit, y = los),
             position = "jitter") +
  labs(
    title = "Distribution of ICU stay length by body temperature",
    x = "Body Temperature (degrees Fahrenheit)",
    y = "ICU Stay Length (Fractional Days)"
  ) 
Warning: Removed 1249 rows containing missing values (`geom_point()`).

mimic_icu_cohort |>
  group_by(temperature_fahrenheit) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = temperature_fahrenheit, y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by body temperature", 
    x = "Body Temperature (degrees Fahrenheit)",
    y = "Total ICU Stay Length (Fractional Days)"
  ) + 
  coord_cartesian(xlim = c(0, 105))
Warning: Removed 1 rows containing missing values (`position_stack()`).

The first plot shows two observations with extremely high body temperatures (possibly due to recording errors?), which masks the variability in the distribution. The second plot ignores these observation, zooming into the distribution of diastolic non-invasive blood pressures between 0-105 degrees Fahrenheit. In the second plot, we can see that most observations are clustered around the average human body temperature, with the longest total ICU stays observed for patients with body temperatures slightly below 100 degrees Fahrenheit. Unusually, there are observations with extremely low body temperatures as well, as seen from the first plot.

  • Length of ICU stay los vs first ICU unit

Answer:

mimic_icu_cohort |>
  group_by(first_careunit) |>
  summarize(total_los = sum(los)) |>
  ggplot() +
  geom_col(mapping = aes(x = fct_reorder(as.factor(first_careunit), total_los), 
                         y = total_los)) +
  labs(
    title = "Distribution of ICU stay length by first ICU unit", 
    x = "First ICU Unit",
    y = "Total ICU Stay Length (Fractional Days)"
  ) +
  coord_flip()

Patients transferred to the Medical Intensive Care Unit (MICU) have the longest total ICU stay, followed by patients transferred to the Surgical Intensive Care Unit (SICU) and the Medical/Surgical Intensive Care Unit (MICU/SICU). Patients transferred to Neuro Stepdown have the shortest total ICU stay.